Skip to content

Comments

Extracts BitString to code points array to BitString class#721

Draft
mward-sudo wants to merge 7 commits intobartblast:devfrom
mward-sudo:02-19-extracts_bitstring_to_code_points_array_to_bitstring_class
Draft

Extracts BitString to code points array to BitString class#721
mward-sudo wants to merge 7 commits intobartblast:devfrom
mward-sudo:02-19-extracts_bitstring_to_code_points_array_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 19, 2026

Closes #720

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

@coderabbitai Do not review this code while the PR is in draft.

Summary by CodeRabbit

  • Refactor

    • Standardized UTF‑8 validation and handling across character processing.
    • Improved handling for invalid or truncated UTF‑8 sequences and more reliable codepoint conversion.
    • Enhanced Unicode normalization paths (NFC, NFD, NFKC, NFKD).
  • Tests

    • Expanded test coverage for UTF‑8 decoding, validation, truncation detection, and codepoint conversion.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Centralizes UTF‑8 validation and decoding into Bitstring by adding static UTF‑8 helpers (decoding, validation, truncation detection, and codepoint conversion) and refactors assets/js/erlang/unicode.mjs to use these helpers; adds extensive tests for the new Bitstring behavior.

Changes

Cohort / File(s) Summary
UTF‑8 Utilities in Bitstring
assets/js/bitstring.mjs
Added static UTF‑8 helpers: decodeUtf8CodePoint, getValidUtf8Length, isValidUtf8CodePoint, isValidUtf8ContinuationByte, isValidUtf8Sequence, isTruncatedUtf8Sequence, toCodepointArray (plus related sequence-length logic).
Unicode module refactor
assets/js/erlang/unicode.mjs
Removed in-file UTF‑8 helpers and replaced usages with Bitstring utilities across character conversion and normalization flows (NFC/NFD/NFKC/NFKD); internal logic changed, public APIs unchanged.
Tests for Bitstring UTF‑8
test/javascript/bitstring_test.mjs
Added comprehensive tests covering decoding, sequence-length, valid-length detection, truncation, continuation checks, codepoint array conversion, and edge cases (overlongs, surrogates, out-of-range, truncation).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: extracting BitString-to-codepoints-array functionality into the BitString class as a static method.
Linked Issues check ✅ Passed The PR fulfills issue #720 by successfully extracting toCodepointArray(bitstring) as a static BitString class method, along with supporting UTF-8 validation utilities.
Out of Scope Changes check ✅ Passed All changes are in scope: BitString.mjs additions (UTF-8 utilities and toCodepointArray), unicode.mjs refactoring to use new BitString methods, and comprehensive test coverage.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
assets/js/bitstring.mjs (1)

255-269: Consider hoisting lookup objects to module/class level.

firstByteMasks (Line 259) and minValueForLength (Line 623) are allocated on every call. Since these are constants, they could be static class fields or module-level constants to avoid repeated allocation.

♻️ Suggested refactor
 export default class Bitstring {
   static `#decoder` = ERTS.utf8Decoder;
   static `#encoder` = new TextEncoder("utf-8");
+  static `#utf8FirstByteMasks` = {2: 0x1f, 3: 0x0f, 4: 0x07};
+  static `#utf8MinCodePointForLength` = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};

Then reference $.#utf8FirstByteMasks[length] and $.#utf8MinCodePointForLength[encodingLength] respectively.

Also applies to: 617-633

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, The function
decodeUtf8CodePoint allocates the lookup object firstByteMasks on every call
(and similarly minValueForLength elsewhere), so hoist these constant maps to
module/class scope as static fields to avoid repeated allocation; create
constants (e.g. utf8FirstByteMasks and utf8MinCodePointForLength) at the top of
the module or as private static fields on the BitString class and replace local
usages of firstByteMasks and minValueForLength with references to those
static/module constants (e.g., BitString.#utf8FirstByteMasks[length] or the
module-level utf8FirstByteMasks[length]).
assets/js/erlang/unicode.mjs (1)

271-354: Consider extracting a shared normalization handler parameterized by form.

The characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 functions are structurally identical — they differ only in the normalization form string ("NFC", "NFD", "NFKC", "NFKD"). The handleInvalidUtf8, handleConversionError, and validateListRest helpers are duplicated verbatim across all four.

Since this PR already modifies the handleInvalidUtf8 in each variant (switching to Bitstring.getValidUtf8Length), it would be a natural time to extract a shared factory:

♻️ Sketch
const makeNormalizationBinaryFn = (form) => (data) => {
  const validateListRest = (rest) => { /* shared */ };
  const handleConversionError = (tag, prefix, rest) => {
    // ...textPrefix.normalize(form)...
  };
  const handleInvalidUtf8 = (bytes) => {
    // ...validText.normalize(form)...
  };
  // ...main logic identical...
};

"characters_to_nfc_binary/1": makeNormalizationBinaryFn("NFC"),
"characters_to_nfd_binary/1": makeNormalizationBinaryFn("NFD"),
"characters_to_nfkc_binary/1": makeNormalizationBinaryFn("NFKC"),
"characters_to_nfkd_binary/1": makeNormalizationBinaryFn("NFKD"),

Also applies to: 501-583, 587-668, 672-754

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 271 - 354, The four functions
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 duplicate the same
helpers (validateListRest, handleConversionError, handleInvalidUtf8) and differ
only by the normalization form string; extract a factory like
makeNormalizationBinaryFn(form) that returns the function implementing the
shared logic and uses the form when calling String.prototype.normalize (i.e.,
replace hardcoded "NFC" with the parameter), then replace each of the four
exported entries with makeNormalizationBinaryFn("NFC"),
makeNormalizationBinaryFn("NFD"), makeNormalizationBinaryFn("NFKC"), and
makeNormalizationBinaryFn("NFKD") respectively, keeping references to
Erlang_Unicode["characters_to_binary/3"], Bitstring.toText,
Bitstring.getValidUtf8Length, Type.bitstring, and Type.tuple unchanged inside
the factory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 659-678: isTruncatedUtf8Sequence can return true when start >=
bytes.length because bytes[start] is undefined; add an early guard in
isTruncatedUtf8Sequence to return false if start is out of range (start < 0 or
start >= bytes.length) so you never read bytes[start]; keep the rest of the
logic intact (using $.getUtf8SequenceLength and $.isValidUtf8ContinuationByte)
after this check to correctly detect only true truncated sequences.
- Around line 832-839: toCodepointArray currently calls maybeSetTextFromBytes
which can set bitstring.text to false on UTF-8 decode failure, and then
Array.from(false, ...) silently returns []; change to explicitly guard after
maybeSetTextFromBytes in static toCodepointArray(bitstring): if bitstring.text
=== false (or not a string) throw a descriptive TypeError (or return null per
API decision) indicating invalid UTF-8 bitstring so callers don't get an empty
array silently; reference the toCodepointArray method and maybeSetTextFromBytes
when adding the check and the error message.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: The function decodeUtf8CodePoint allocates the lookup
object firstByteMasks on every call (and similarly minValueForLength elsewhere),
so hoist these constant maps to module/class scope as static fields to avoid
repeated allocation; create constants (e.g. utf8FirstByteMasks and
utf8MinCodePointForLength) at the top of the module or as private static fields
on the BitString class and replace local usages of firstByteMasks and
minValueForLength with references to those static/module constants (e.g.,
BitString.#utf8FirstByteMasks[length] or the module-level
utf8FirstByteMasks[length]).

In `@assets/js/erlang/unicode.mjs`:
- Around line 271-354: The four functions characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and
characters_to_nfkd_binary/1 duplicate the same helpers (validateListRest,
handleConversionError, handleInvalidUtf8) and differ only by the normalization
form string; extract a factory like makeNormalizationBinaryFn(form) that returns
the function implementing the shared logic and uses the form when calling
String.prototype.normalize (i.e., replace hardcoded "NFC" with the parameter),
then replace each of the four exported entries with
makeNormalizationBinaryFn("NFC"), makeNormalizationBinaryFn("NFD"),
makeNormalizationBinaryFn("NFKC"), and makeNormalizationBinaryFn("NFKD")
respectively, keeping references to Erlang_Unicode["characters_to_binary/3"],
Bitstring.toText, Bitstring.getValidUtf8Length, Type.bitstring, and Type.tuple
unchanged inside the factory.

@mward-sudo
Copy link
Contributor Author

@coderabbitai do not review this code again while it is in draft. Also, please change the summary to include only changes from the latest commit in this PR.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo, understood! I will not review this PR while it remains in draft status.

I'll regenerate the summary to include only the changes from the latest commit in this PR.

Please regenerate the summary to include only the changes from the latest commit in PR #721, excluding changes from the dependent PRs (#710, #715, #716, #717, #718, #719).


✏️ Learnings added
Learnt from: mward-sudo
URL: https://github.com/bartblast/hologram/pull/721

Timestamp: 2026-02-19T01:23:42.278Z
Learning: User mward-sudo prefers that draft PRs should not be automatically reviewed by CodeRabbit until the draft status is removed.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Actions performed

Summary regeneration triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract conversion of bitstring to code point array to Bitstring class

1 participant